knitr::include_graphics("tree_output.pdf")
data <- read.delim("SENIC.txt", header = FALSE, sep = "")
names(data) <- c("Identification Number", "Length of Stay", "Age", "Infection Risk", "Routine Culturing Ratio", "Routine Chest X-ray Ratio", "Number of Beds", "Medical School Affiliation", "Region", "Average Daily Census", "Number of Nurses", "Available Facilities & Services")
| Identification Number | Length of Stay | Age | Infection Risk | Routine Culturing Ratio | Routine Chest X-ray Ratio | Number of Beds | Medical School Affiliation | Region | Average Daily Census | Number of Nurses | Available Facilities & Services |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7.13 | 55.7 | 4.1 | 9.0 | 39.6 | 279 | 2 | 4 | 207 | 241 | 60 |
| 2 | 8.82 | 58.2 | 1.6 | 3.8 | 51.7 | 80 | 2 | 2 | 51 | 52 | 40 |
| 3 | 8.34 | 56.9 | 2.7 | 8.1 | 74.0 | 107 | 2 | 3 | 82 | 54 | 20 |
| 4 | 8.95 | 53.7 | 5.6 | 18.9 | 122.8 | 147 | 2 | 4 | 53 | 148 | 40 |
| 5 | 11.20 | 56.5 | 5.7 | 34.5 | 88.9 | 180 | 2 | 1 | 134 | 151 | 40 |
| 6 | 9.76 | 50.9 | 5.1 | 21.9 | 97.0 | 150 | 2 | 2 | 147 | 106 | 40 |
| 7 | 9.68 | 57.8 | 4.6 | 16.7 | 79.0 | 186 | 2 | 3 | 151 | 129 | 40 |
| 8 | 11.18 | 45.7 | 5.4 | 60.5 | 85.8 | 640 | 1 | 2 | 399 | 360 | 60 |
| 9 | 8.67 | 48.2 | 4.3 | 24.4 | 90.8 | 182 | 2 | 3 | 130 | 118 | 40 |
| 10 | 8.84 | 56.3 | 6.3 | 29.6 | 82.6 | 85 | 2 | 1 | 59 | 66 | 40 |
outlier_indices <- function(column) {
q1 <- quantile(column)[2]
q3 <- quantile(column)[4]
return(column > q3+1.5*(q3-q1) | column < q1-1.5*(q3-q1))
}
Above figure shows 6 outliers on infection risk, three of them are grater than than q3+1.5(q3-q1), while the others are below q1-1.5(q3-q1). Infection risk appears to be approximately normally distributed with a mean around 4.5.
Above figure shows a positive relationship between Infection Risk and Number of Nurses, due to color scale saturation it’s hard to examine the graph.
A potential danger of using such color scale for the Number of Beds is that the outliers can have a big impact on the scale, which can make it hard to distinguish variation in Number of Beds among the other observations. In the figure above it can be hard to distinguish an observation with less than 100 beds from an observation with 200 beds.
Changing to ggplotly made it easy to navigate the density graph and shows what are the outliers specifically.
data <- read.delim("SENIC.txt", header = FALSE, sep = "")
data <- data[,c(2:7, 10:12)]
names(data) <- c("Length of Stay", "Age", "Infection Risk", "Routine Culturing Ratio", "Routine Chest X-ray Ratio", "Number of Beds", "Average Daily Census", "Number of Nurses", "Available Facilities & Services")
outlier_indices <- function(column) {
q1 <- quantile(column)[2]
q3 <- quantile(column)[4]
return(column > q3+1.5*(q3-q1) | column < q1-1.5*(q3-q1))
}
ui <- fluidPage(
checkboxGroupInput("variable", "Select variables:",
c("Length of Stay", "Age", "Infection Risk", "Routine Culturing Ratio", "Routine Chest X-ray Ratio", "Number of Beds", "Average Daily Census", "Number of Nurses", "Available Facilities & Services")),
sliderInput("bandwidth", "Bandwidth:", min = 1, max = 1000, value = 1),
mainPanel(plotOutput("density_plots"))
)
server <- function(input, output, session) {
output$density_plots <- renderPlot({
if(length(input$variable) > 0) {
plots <- lapply(input$variable, function(col) {
p <- ggplot(data, aes(x = data[,col])) +
geom_density(bw = input$bandwidth) +
labs(x = col)
if(nrow(data[outlier_indices(data[,col]),]) > 0) {
p <- p + geom_point(data = data[outlier_indices(data[,col]),], aes(x = data[outlier_indices(data[,col]),col], y = 0), shape = "diamond", size = 4)
}
p
})
plot(arrangeGrob(grobs = plots))
}
})
}
# Run the application
shinyApp(ui = ui, server = server)
A higher bandwidth value makes the density plot more smoothed, while a lower bandwidth value makes the density plot closer to the actual, observed data.
The optimal bandwidth depend on the graph’s scale, but we estimated that the most suitable bandwidths for all graph are between 0.3 and 0.4 times the interquartile range of the variable.
library(ggplot2)
library(gridExtra)
library(plotly)
################
#### Step 1 ####
################
data <- read.delim("SENIC.txt", header = FALSE, sep = "")
names(data) <- c("Identification Number", "Length of Stay", "Age", "Infection Risk", "Routine Culturing Ratio", "Routine Chest X-ray Ratio", "Number of Beds", "Medical School Affiliation", "Region", "Average Daily Census", "Number of Nurses", "Available Facilities & Services")
################
#### Step 2 ####
################
outlier_indices <- function(column) {
q1 <- quantile(column)[2]
q3 <- quantile(column)[4]
return(column > q3+1.5*(q3-q1) | column < q1-1.5*(q3-q1))
}
################
#### Step 3 ####
################
p_density <- ggplot(data, aes(x = data[,4])) +
geom_density() +
labs(x = names(data)[4]) +
geom_point(data = data[outlier_indices(data[,4]),], aes(x = data[outlier_indices(data[,4]),4], y = 0), shape = "diamond", size = 4)
p_density
################
#### Step 4 ####
################
plots <- lapply(c(2:7, 10:12), function(col) {
p <- ggplot(data, aes(x = data[,col])) +
geom_density() +
labs(x = names(data)[col])
if(nrow(data[outlier_indices(data[,col]),]) > 0) {
p <- p + geom_point(data = data[outlier_indices(data[,col]),], aes(x = data[outlier_indices(data[,col]),col], y = 0), shape = "diamond", size = 4)
}
p
})
plot(arrangeGrob(grobs = plots))
################
#### Step 5 ####
################
p_scatter <- ggplot(data, aes(x = `Infection Risk`, y = `Number of Nurses`)) +
geom_point(aes(col = `Number of Beds`))
p_scatter
################
#### Step 6 ####
################
ggplotly(p_density)
################
#### Step 7 ####
################
fig <- data %>%
plot_ly(x = ~`Infection Risk`, type = "histogram") %>%
add_markers(x = ~`Infection Risk`[outlier_indices(data[,4])], y = 0, marker = list(symbol = "diamond", size = 10)) %>%
layout(xaxis = list(title = "Infection Risk"), yaxis = list(title = "Frequency"), showlegend = FALSE)
fig
Simon and Mohamed devised the whole assignment together, the main conceptual ideas and codes outline. Mohamed worked out Assignment 1, tree visualization, and the report creation using r markdown, Simon worked out Assignment 2 and carried out all codes and functions, including the r shiny App.
Comment:
The variables Age, Infection Risk, Routine Chest X-ray Ratio and Available Facilities & Services are approximately normally distributed. Age and Infection Risk both have outliers that are greater than Q3+1.5(Q3-Q1) and smaller than Q1-1.5(Q3-Q1). Routine Chest X-ray Ratio only have one outlier that is greater than Q3+1.5(Q3-Q1), and the variable Available Facilities & Services have no outliers.
The distributions of the variables Length of Stay, Routine Culturing Ratio, Number of Beds, Average Daily Census and Number of Nurses are all right-skewed. Observations whose values are greater than Q3+1.5(Q3-Q1) exists in all of these variables.
5. Scatter plot showing the dependence of Infection risk on the Number of Nurses where points are colored by Number of Beds